Pro-cycling team cyclist assignment for an upcoming race

Professional bicycle racing is a popular sport that has attracted significant attention in recent years. The evolution and ubiquitous use of sensors allow cyclists to measure many metrics including power, heart rate, speed, cadence, and more in training and racing. In this paper we explore for the first time assignment of a subset of a team’s cyclists to an upcoming race. We introduce RaceFit, a model that recommends, based on recent workouts and past assignments, cyclists for participation in an upcoming race. RaceFit consists of binary classifiers that are trained on pairs of a cyclist and a race, described by their relevant properties (features) such as the cyclist’s demographic properties, as well as features extracted from his workout data from recent weeks; as well additional properties of the race, such as its distance, elevation gain, and more. Two main approaches are introduced in recommending on each stage in a race and aggregate from it to the race, or on the entire race. The model training is based on binary label which represent participation of cyclist in a race (or in a stage) in past events. We evaluated RaceFit rigorously on a large dataset of three pro-cycling teams’ cyclists and race data achieving up to 80% precision@i. The first experiment had shown that using TP or STRAVA data performs the same. Then the best-performing parameters of the framework are using 5 weeks time window, imputation was effective, and the CatBoost classifier performed best. However, the model with any of the parameters performed always better than the baselines, in which the cyclists are assigned based on their popularity in historical data. Additionally, we present the top-ranked predictive features.

Since in the previous experiments it was seen that the performance was quite similar whether using TP or STRAVA, we extend our experiments to two more pro-cycling teams, Jumbo-Visma and Groupma-FDJ, whose TP data was not accessible.We describe here the use of the cyclist-race modeling results after applying them to each of the teams, and their average to demonstrate the general performance.Fig. 1 presents the performance of recent workouts over multiple time windows, represented in each curve using the cyclist-race model, in comparison to the baselines.The charts for all the teams perform quite similarly, while for IPT, Fig. 1a, the performance at 7 weeks is quite better than the others.For Groupama-FDJ, Fig. 1b, the 3-week window performs better than the others, and for Jumbo-Visma, Fig. 1c, on average, 1d the performance is quite the same for any of the time windows.In all charts, the use of RaceFit outperforms the popularity baselines.Note that these results are the mean of multiple modeling choices, including the use of not using imputation and data removal, as well as several classifiers, whose results are relatively not so good, but this does show the overall effect of using several weeks.In the later experiments, we will show the results with the best performing values.
Fig. 2 shows the performance when using imputation or not, which clearly shows that using imputation was much better for all the teams.However, without imputation, the performance is still better than that of the popularity baselines.
Fig. 3 shows the performance using several classifiers, in which all perform better than the baselines, while the use of the CatBoost classifier performs meaningfully better in both metrics.Random Forest is second best after CatBoost for all teams, and for IPT, the third-best classifier is Logistic Regression, followed by Decision Tree.For Groupama-FDJ and Jumbo-Visma teams and on average for all teams, the Decision Tree classifier is slightly better than Logistic Regression.
Fig. 4 shows the recall@(n + k), in which n is the number of cyclists who are required to participate in an upcoming race and k is the number of spare cyclists the system recommends for the upcoming race.We see that when requesting the exact number of required cyclists (k = 0) for the upcoming race, CatBoost correctly recommends about 50% of the cyclists who in fact participated in the race.In all charts, the CatBoost and Random Forest classifiers perform similarly.However, in Fig. 4a, for all k, Logistic Regression performs better than the Decision Tree.In Fig. 4b, the Decision Tree performs better than Logistic Regression for most of the k values, and in Figs.4c and 4d, the same performance is shown, but for values of k smaller than four.
Fig. 5 shows the performance of the best-performing parameters on average for all teams, based on the previous analyses.The best time window size on average over all the teams is 3 weeks, the threshold which performs best is the 40% missing value dropping ratio, use of imputation improves the results, and CatBoost outperforms the other classifiers.The parameter results on average over the teams are not identical to the best-performing parameters for each team.Fig. 5a, for example, shows the results of IPT, in which the best-performing parameters are a time window of 7 weeks and a threshold of 60%.In Fig. 5c of Jumbo-Visma, the best-performing time window is 7 weeks.For Groupama-FDJ, the best-performing parameters are identical to those which were chosen on average.However, while the differences in the parameter values are not that meaningful in most previous analyses, in the next figure, we present the performance of each team according to their best parameters' values.Fig. 6 presents the model's parameters with the best-performing values for each team individually.For IPT, the best time window size is 7 weeks, while for Jumbo-Visma, it is the 5-week window.The best time window of Groupama-FDJ is 3 weeks as, on average, for all the teams.The threshold for Figs.6b and 6c is 40%, like the threshold chosen to best perform in the average result graphs.However, in the IPT experiments, the best-performing threshold is 60%.

Experiment 4-Cyclist-Stage all teams on STRAVA
We describe here the results of the cyclist-stage experiment on the various teams' on STRAVA data.Fig. 7 presents the performance of the different time windows in weeks.For all the teams, the use of a 5-week time window yields the best performance.However, for IPT, the next best time window is 3 weeks, while for the other teams, on average, the 3-week window performs worse than the 7-week window.For all time windows, RaceFit performs better than the popularity baselines.Fig. 8 shows the advantage of using imputation.In all of the teams, the use of imputation improves performance.However, even without imputation, the performance is better than the popularity baselines.
In Fig. 9, the results of the classifiers that were evaluated are shown.For all the teams, the CatBoost classifier outperforms the others.However, the performance for IPT (Fig. 9a) seems much better than for the other teams.On average and in Figs.9b  and 9c, Random Forest is the next best performing classifier, while for IPT (Fig. 9a), Random Forest performs similarly to Logistic Regression.Fig. 10 shows the recall@(n + k) for the cyclist-stage version of RaceFit, in which n is the actual number of cyclists recommended and k is the number of spare cyclists recommended, for the coach to choose from if needed.The CatBoost classifier yields the best performance for all of the teams, followed by Random Forest.For IPT, the classifier following Random Forest is the Logistic Regression classifier and then Decision Tree for each k, but for Groupama-FDJ, Decision Tree performs slightly better.In Fig. 10c of Jumbo-Visma, for most of the k values Decision Tree performs somewhat better, and for k greater than 4, Logistic Regression performs better.For all the teams, the baselines perform more poorly than the RaceFit algorithm.Fig. 11 presents the best-performing parameters on average over all teams' results.The best performance achieved by window size is the 5-week window, the best threshold on average is the 60% missing value dropping ratio, use of imputation to fill missing values, and CatBoost as a classifier.The only team with different best performance parameters is Groupama-FDJ, which has a threshold of 40% as the best-performing threshold.RaceFit with the best parameters for each of the teams, and their average, shows that it is much better than the popularity baselines.
Fig. 12 shows the results of the best parameters chosen for each oteam individually.The best time window for all the teams is the 5-week window.The best threshold is 60% for all the teams, except for Groupama-FDJ (Fig. 12b), which is 40%.The best classifier for all teams is CatBoost, and use of imputation was effective for all as well.Looking at the results in Fig. 11 and the parameters that worked best for all teams on average, the results perform quite similarly; thus, it is reasonable to use the parameter values that work for all teams on average.
Fig 1.Comparison of different numbers of weeks of workout time windows for each team, and their average performance.The different time windows perform quite similarly, and better than the baselines.

Fig 2 .
Fig 2. The use of imputation performs clearly better than without imputation.

Fig 3 .
Fig 3. CatBoost performs best in all cases, and RaceFit performs better than the baselines.
Fig 4. Classifier parameter comparison of the different teams.CatBoost performs best, in all cases, RaceFit performs better than the baselines.
Fig 5. RaceFit with the best parameters for each of the teams, and their average, shows that it is much better than the popularity baselines.

Fig 6 .Fig 7 .
Fig 6.Comparison of the best parameters for each team individually.

Fig 8 .Fig 9 .Fig 11 .Fig 12 .
Fig 8. RaceFit with and without imputation.The use of imputation improves the results for all the teams.